We present a physics-based inverse rendering method that learns the illumination, geometry, and materials of a scene from posed multi-view RGB images. To model the illumination of a scene, existing inverse rendering works either completely ignore the indirect illumination or model it by coarse approximations, leading to sub-optimal illumination, geometry, and material prediction of the scene. In this work, we propose a physics-based illumination model that explicitly traces the incoming indirect lights at each surface point based on interreflection, followed by estimating each identified indirect light through an efficient neural network. Furthermore, we utilize the Leibniz's integral rule to resolve non-differentiability in the proposed illumination model caused by one type of environment light -- the tangent lights. As a result, the proposed interreflection-aware illumination model can be learned end-to-end together with geometry and materials estimation. As a side product, our physics-based inverse rendering model also facilitates flexible and realistic material editing as well as relighting. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed method performs favorably against existing inverse rendering methods on novel view synthesis and inverse rendering.
translated by 谷歌翻译
以移动为中心的AI应用程序对模型推断的资源效率有很高的要求。输入过滤是消除冗余以降低推理成本的有前途的方法。以前的努力已经针对许多应用程序量身定制了有效解决方案,但是尚未解决两个基本问题:(1)推理工作量的理论滤波器可指导输入过滤技术的应用,从而避免了资源受限的移动应用程序的试用成本; (2)功能嵌入的可辨别性可允许输入过滤对各种推理任务和输入内容有效。为了回答它们,我们首先将输入过滤问题正式化,理论上比较了推理模型和输入过滤器的假设复杂性,以了解优化潜力。然后,我们提出了第一个端到端可学习的输入过滤框架,该框架涵盖了大多数最先进的方法,并以可强大的可区分性嵌入功能。我们设计和实施支持六种输入方式和多个以移动为中心的部署的INFI。综合评估证实了我们的理论结果,并表明INFI在适用性,准确性和效率方面的表现优于强大的基准。 INFI获得8.5倍的吞吐量并节省95%的带宽,同时保持超过90%的精度,以用于移动平台上的视频分析应用程序。
translated by 谷歌翻译
我们提出了一种新颖的场景表示,其编码达到距离 - 沿着可行轨迹的场景中的任何位置之间的距离。我们证明,该环境现场表示可以直接指导2D迷宫或3D室内场景中代理的动态行为。我们的环境领域是一种连续表示,通过使用离散采样的培训数据通过神经隐式功能学习。我们展示其在2D迷宫中的代理导航应用,3D室内环境中的人为轨迹预测。为了为人类生产物理似品和自然的轨迹,我们还学习了一种生成模型,该模型预测了人类通常出现的区域,并强制执行要在这些区域内定义的环境场。广泛的实验表明,所提出的方法可以有效准确地产生可行和合理的轨迹。
translated by 谷歌翻译
向量图形文档呈现多个视觉元素,例如图像,形状和文本。对于业余爱好者和专业设计师来说,为多个视觉元素选择合适的颜色是一项艰巨但至关重要的任务。我们没有为所有元素创建单个调色板,而是从图形文档中的每个视觉元素中提取多个调色板,然后将它们组合成颜色序列。我们为颜色序列完成提出了一个掩盖的颜色模型,并建议基于多板的颜色上下文的指定颜色,概率很高。我们训练模型并在矢量图形文档的大规模数据集上构建颜色建议系统。提出的颜色建议方法通过定量和定性评估对颜色预测和我们的颜色推荐系统的表现优于其他最先进的方法,并在访谈研究中收到了专业设计师的积极反馈。
translated by 谷歌翻译
与2D对象检测不同,其中所有ROI功能来自网格像素,3D点云对象检测的ROI特征提取更加多样化。在本文中,我们首先比较和分析两个最先进模型PV-RCNN和Voxel-RCNN之间的结构和性能的差异。然后,我们发现两种模型之间的性能差距不来自点信息,而是结构信息。 Voxel特征包含更多结构信息,因为它们会进行量化而不是向下采样到点云,以便它们基本上可以包含整个点云的完整信息。体素特征中的强大结构信息使得探测器在我们的实验中具有更高的性能,即使体素功能没有准确的位置信息,也可以在我们的实验中进行更高的性能。然后,我们建议结构信息是3D对象检测的关键。基于上述结论,我们提出了一种自我关注的ROI特征提取器(SARFE),以增强从3D提案中提取的特征的结构信息。 SARFE是一种即插即用模块,可以轻松使用现有的3D探测器。我们的SARFE在Kitti DataSet和Waymo Open DataSet上进行评估。通过新引进的SARFE,我们通过在Kitti DataSet上的骑自行车者中的大型余量来提高最先进的3D探测器的性能,同时保持实时能力。
translated by 谷歌翻译
图形神经网络(GNNS)是将图形数据作为输入的深度学习模型,它们应用于各种任务,例如交通预测和分子特性预测。然而,由于GNN的复杂性,难以分析输入的哪些部分影响GNN模型的输出。在本研究中,我们扩展了卷积神经网络(CNNS)的解释方法,例如局部可解释模型 - 不可止结的解释(石灰),基于梯度的显着性图和梯度加权类激活映射(Grad-Cam)到GNN,以及预测输入图中的哪些边对于GNN决策很重要。实验结果表明,基于石灰的方法是最有效的解释性方法,用于多个任务中的现实情况,甚至在GNN解释性中表现出最先进的方法。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译